Executive Summary

In 2006 Elon Musk wrote a blog post titled, The Secret Tesla Motors Master Plan (just between you and me), where he detailed out the strategy for the company. In this post, he offered up a multi-step plan to achieve the mission slated by the company:

Tesla’s mission is to accelerate the world’s transition to sustainable energy.

The plan detailed four steps that would eventually lead to what is known as the Tesla Model 3, the first affordable, high-performance, no-compromise electric car. As of March 2020, the model 3 became the all-time best selling plug-in electric car surpassing the Nissan LEAF, and it accomplished this in just 2.5 years, versus ten years for the LEAF1.

As the plan noted, the goal was to build a more affordable car accessible to more people than the previous premium market products. The Model 3 was launched with a $35,000 price point making it competitive with entry-level German vehicles. It was met with an incredible reception garnering 200,000 pre-orders in the first 24 hours after it’s launch2. It’s since sold 500,000 units and continues to be loved.

However, while the company is doing fantastic, its stock at all-time highs and soon entering the S&P 5003 has seen its share of Quality issues. In June of 2020, J.D. Power released its annual quality study showing that Tesla was ranked last among 32 automotive brands4. Bloomberg performed a survey of 5,000 Model 3 owners published in October of 2019, where owners submitted details of their quality issues 5. Owners stated that the most significant problems were with paint and panel gaps. While the report found that cars’ defects cut in half over time, Tesla is still working to optimize its production.

This report will look at user’s discussions, mostly in 2020, from the Tesla Model 3 Discussion Forums to surface what is top of mind and what issues might still be effecting the Model 3. The forums are an open place where people can post topics, ask questions, or generally participate in the community. User forums are rich with information that can give an alternative view into customer sentiment, unlike Social Media or traditional Surveys.

The plan’s four-step master plan:

  1. Build sports car
  2. Use that money to build an affordable car
  3. Use that money to build an even more affordable car
  4. While doing above, also provide zero emission electric power generation options

Tesla Model 3 Performance

# plotting and pipes
library(tidyverse)
library(stringr)
library(tidyr)

# text mining library
library(tm)
library(tidytext)
library(wordcloud)
library(reshape2)
library(textstem)
library(ggraph)
library(igraph)
library(widyr)
library(spacyr)
library(SnowballC)
library(topicmodels)
library(quanteda)
library(seededlda)
library(parallel)
library(ldatuning)

# date/time library
library(lubridate)

# Read in the tesla forum data
df <- read.csv('tesla_forums.csv')

# Adjust variable types
df$Time <- as_datetime(df$Time)
df$User <- as.factor(df$User)
df$Topic <- as.factor(df$Topic)
# Drop a small amount of rows with NA values
df <- drop_na(df)
# Removed all duplicates.  The scraping method used created quite a few.
df <- distinct(df)
# Remove the first topic, it's just the "how to use the forums" thread and doesn't aid in analysis
df <- df[-c(1:24), ]
# Add Doc_Id incrementing per Row
df <- df %>%
  mutate(doc_id = paste0("doc", row_number())) %>%
  select(doc_id, everything())
# Add a Column for Text Length
df$text_len <- str_count(df$Discussion)

Exploratory Data Analysis

Perform an Exploratory Data Analysis (EDA) to better understand the characteristic, extents, and shape of our data.

df %>%
  select(Discussion, Time, text_len) %>%
  summary()
##   Discussion             Time                        text_len     
##  Length:54311       Min.   :2015-12-10 19:16:17   Min.   :   1.0  
##  Class :character   1st Qu.:2019-10-04 08:41:09   1st Qu.:  86.0  
##  Mode  :character   Median :2020-04-03 16:40:26   Median : 179.0  
##                     Mean   :2020-01-26 01:41:25   Mean   : 278.4  
##                     3rd Qu.:2020-07-29 17:58:40   3rd Qu.: 348.0  
##                     Max.   :2020-12-15 02:19:29   Max.   :7944.0
# Make a copy of the original DF so it can be referenced later.
df_select <- df

Summary

  • Discussions: There are a total of 54,311 discussion threads in this dataset after removing duplicates. This is essentially like a comment on a Facebook post. A Topic (not shown) is posted, and Discussions happen on those topics.
  • Time: Dates range from 2015-12-1o to 2020-12-15. The Median, Mean and 3rd Quartile are all in 2020 telling us that most of the dates in this set are in 2020.
  • Text_Len: Min length of text is 0 and max is 7,944 characters with a median of 179.0.

Topic & User Information

The way that this data is stored is that for each discussion row, the topic title is repeated. Therefore we need to summarize the rows and aggregate them into counts for each unique topic. This way we can also see how many of discussions are

df_topics <- df_select %>%
  group_by(Topic) %>%
  summarise(count = n(), .groups="keep") %>%
  arrange(desc(count))
head(df_topics)
df_topics %>%
  ggplot(aes(count)) + 
  geom_histogram(fill="lightgray", color="gray", bins=30) +
  theme_minimal() +
  scale_y_log10() +
  labs(x = "Number of Discussions per Topic",
       y = "Count (Log10 Scale)",
      title = "Distributions of Discussions per Topic",
      subtitle = "Number of replies per unqiue thread"
      ) +
  theme(plot.title = element_text(face = "bold"))

Regarding the number of Discussions per Topic, a heavily right skewed distribution with a range of 500-1,000 total topics with 0-25 discussions each. After 25 or so (x-axis), there are just a few with greater than 25 replies per topic. There are two topics above 75, as noted in the table above.

Total Topics

sprintf("There are %s unique topics", nrow(df_topics))
## [1] "There are 3676 unique topics"
df_users <- df_select %>% 
  group_by(User) %>%
  summarise(count = n(), .groups="keep") %>% 
  arrange(desc(count))
head(df_users, n=10)

The forums are quite active by various users. 8 users have over 1,000 posts in this dataset.

df_users %>%
  ggplot(aes(count)) + 
  geom_histogram(fill="lightgray", color="gray", bins=30) +
  theme_minimal() +
  scale_y_log10() +
  labs(x = "Number Posts",
       y = "Count (Log10 Scale)",
      title = "Distributions of Active Users",
      subtitle = "Number of unique entried per user name"
      ) +
  theme(plot.title = element_text(face = "bold"))

A large number of users have a very small number of posts, 1,500+. There are a small number that are extremely active on the forums having > 500 posts.

Total Users

sprintf("There are %s total unique users", nrow(df_users))
## [1] "There are 6548 total unique users"

Text Length Analysis

summary(df_select$text_len)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    86.0   179.0   278.4   348.0  7944.0

Discussion lengths for the dataset range from 0 characters to 7,944 with a median of 179 with a mean of 278.

df_select %>%
  ggplot(aes(text_len)) + 
  geom_histogram(fill="lightgray", color="gray", bins=30) +
  theme_minimal() +
  scale_y_log10() +
  labs(x = "Text Length",
       y = "Count (Log10 Scale)",
      title = "Distributions of Text Length",
      subtitle = "Per character counts of the replies to topics"
      ) +
  theme(plot.title = element_text(face = "bold"))

Text length for posts is right skewed as well with most posts being shorter in length. But there is a much more spread distribution towards the right tail.

Discussion Frequency

df %>%
  mutate(date = floor_date(Time, "week")) %>%
  group_by(date) %>%
  summarize(count = n(), .groups = 'keep') %>%
  
  
  ggplot(aes(date, count)) +
  geom_line(show.legend = FALSE) +
  theme_minimal() +
  labs(
    x = NULL,
    y = "Frequency",
    title = "Number of Discussion Posts per Week",
    subtitle = "Total count of comments/replies per week"
  ) +
  theme(plot.title = element_text(face = "bold"))

When viewing the time frequency of the posts, the data does go back to 2016 but activity jumps at the start of 2020. Due to the way these were scraped from the Forums, staring with newest posts and working backwards, it should be the case that we are loaded more in the current year. There is a dip in posts around mid-2020, this most likely is a error with scraping, vs. lack of activity on the forum. For the sake of this analysis being focused mostly on text, it’s not critical to understand.

Outlier Analysis

df_select %>%
  filter(text_len > 5000) %>%
  select(Discussion) %>%
  head(n=1)

Note: Since this is discussion forum text, outliers are simply long posts as demonstrated above. They will remain in the dataset since longer text often contains valuable information.

Text Cleaning

To better machine-analyze the text extracted from the forum, standard text cleaning is performed to normalize the text. Additionally, the text is lemmatized, transforming words to their lemma, or base word. We will not remove numbers in this operation since we are focused on the Model 3, containing a number in its name.

df_select$Discussion <- iconv(df_select$Discussion, "latin1", "ASCII", sub = "")
df_select$Discussion <- str_replace_all(df_select$Discussion,"\\n"," ")
df_select$Discussion <- str_replace_all(df_select$Discussion,"@","")
df_select$Discussion <- str_replace_all(df_select$Discussion,"="," ")
df_select$Discussion <- str_replace_all(df_select$Discussion,"-"," ")
df_select$Discussion <- gsub("http[[:alnum:][:punct:]]*", "", df_select$Discussion)
df_select$Discussion = removePunctuation(df_select$Discussion)
df_select$Discussion = stripWhitespace(df_select$Discussion)
df_select$Discussion = tolower(df_select$Discussion)
df_select$Discussion = removeWords(df_select$Discussion, c(stopwords('english')))
df_select$Discussion = lemmatize_strings(df_select$Discussion)
tidy_df <- df_select %>%
  unnest_tokens(word, Discussion)

Sentiment Analysis

Sentiment analysis is the process of systematically identifying the emotion of different words in a text corpus. There are several methods available, from text-based lexicon lookups to more advanced machine learning-based models that consider sentence structure. For this exercise, we’ll examine the text through various lexicon-based methods.

Bing Sentiment Lexicon

Using the Bing Lexicon from Bing Liu and collaborators, adds the column “Sentiment” and mark each word as positive or negative.

https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

bing_df <- tidy_df %>%
  inner_join(get_sentiments("bing"), by = "word")
bing_df %>%
  group_by(sentiment) %>%
  summarise(count = n(), .groups = "keep")

Based on a pure lookup, the text in the forum is overall positive. with ~83k positive values and ~69k negative values.

AFINN scoring Lexicon

AFINN from Finn Årup Nielsen, adds the value column, with a numeric representation of how positive, or negative the word is. The AFINN lexicon measures sentiment with a numeric score between -5 and 5

http://www2.imm.dtu.dk/pubdb/pubs/6010-full.html

afinn_df <- tidy_df %>%
  inner_join(get_sentiments("afinn"), by = "word")

head(afinn_df)
afinn_df %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 10, show.legend = FALSE, fill="lightgray", color="darkgray") +
  scale_x_continuous(breaks = c(-5, -3, -1, 1, 3, 5)) +
  theme_minimal() +
  scale_colour_grey(start = 0.3, end = .8) +
  labs(
    x = NULL,
    y = NULL,
    title = "Distribution of AFINN Sentiment Scores by Value",
    subtitle = "Count of occurences of each score value"
  ) +
  theme(plot.title = element_text(face = "bold"))

For the dataset overall, there is a slight left-skew showing there is a greater concentration of words with positive values. There are very few in the high and low values (-4,-5, +5).

Note: 0 is not a valid value in this scoring system, therefore the bin is empty

NRC Sentiment Lexicon

NRC from Saif Mohammad and Peter Turney. The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions as well as positive and negative sentiment.

One thing to note, single words can have multiple emotions

nrc_df <- tidy_df %>%
  inner_join(get_sentiments("nrc"), by = "word")

Total counts for all 8 emotions and 2 sentiments.

nrc_df %>%
  group_by(sentiment) %>%
  summarise(total = n(), .groups = "keep") %>%
  arrange(desc(total))

Top Word Sentiment

Top Word Counts (BING)

bing_df %>%
  count(word, sort = TRUE, sentiment) %>%

  group_by(sentiment) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +
  theme_minimal() +
  labs(x = "Contribution to sentiment",
       y = NULL,
    title = "Top Word Counts",
    subtitle = "Using BING Sentiment Lexicon"
  ) +
  theme(plot.title = element_text(face = "bold"))

Focusing on the Negative words, the top occurrence is issue and second is problem. Given the nature of this forumn, talking about a product, these are very practical words to be on the top of the list. People reporting or discussion issues and problems with their cars. Bug, Noise, Break, and Damage all feel like perfect matches as well. Numb is an interesting occurrence and worth looking into a little more.

Overall Top Words (BING)

bing_df %>%
  count(word, sort = TRUE, sentiment) %>%
  top_n(30) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill = sentiment)) +
  geom_col(show.legend = TRUE) +
  theme_minimal() +
  labs(x = "Contribution to sentiment", y = NULL,
    title = "Top 30 Sentiment Words",
    subtitle = "Grouped by BING Sentiment Classification"
  ) +
  theme(plot.title = element_text(face = "bold"))

Sentiment over Time (AFINN)

plot_df2 <- afinn_df %>%
  filter(Time > "2020-01-10") %>%
  mutate(mon = floor_date(Time, "day")) %>%
  group_by(mon) %>%
  summarize(value = mean(value), .groups = 'keep')

plot_df2$color <- ifelse(plot_df2$value < 0, "negative","positive")

ggplot(plot_df2, aes(mon, value, fill = color)) +
  geom_col(show.legend = FALSE) +
  theme_minimal() +
  labs(x = NULL, y = "Sentiment",
    title = "Sentiment by Week",
    subtitle = "Calculated by Mean AFINN sentiment score "
  ) +
  theme(plot.title = element_text(face = "bold"))

We can see that over time, the sentiment in the forums is positive as measured by the mean sentiment score by day. This doesn’t mean that there isn’t any negative feedback, just that overall, it’s positive on average.

Word Cloud

Word cloud of the top 200 words grouped by sentiment, positive or negative.

bing_df %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray", "black"), max.words = 200)

Topic Identification

Bi-Grams

bigrams <- df_select %>%
  unnest_tokens(bigram, Discussion, token = "ngrams", n = 2)
bigram_counts <- bigrams %>%
  count(bigram, sort = TRUE) %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigram_counts <- drop_na(bigram_counts)
bigram_counts
bigram_graph <- bigram_counts %>%
  filter(n > 300) %>%
  graph_from_data_frame()

set.seed(2017)
a <- grid::arrow(type = "open", length = unit(.05, "inches"))

ggraph(bigram_graph, layout = "nicely") +
  geom_edge_link(arrow = a, end_cap = circle(.02, 'inches')) +
  geom_node_point(color = "gray", size = 2) +
  geom_node_text(aes(label = name), vjust = -1, hjust = 1) +
  theme_minimal()

Some or the key bi-grams that result from the dataset, that are more product related are as follows:

  • Sentry - Mode: This is a special feature of the car which records activity outside the car via the cameras it uses for self driving.
  • Speed - Limit: Most likey related to the limits that are possible when using the self driving feature.
  • Steer - Wheel: Steering wheel related feedback.
  • Tesla - App: The mobile app supported on Apple and Android devices.
  • Software - Update: Tesla’s go through regular software updates, every 1-2 weeks.
  • Take - Delivery: Related to the purchase process.
  • Mile - Range: Being a batter powered car, range is a highly talked about issue.
  • Battery - Degradation: Similar to range, do batteries retain their health.
  • Charge - Port: The inlet for the charging adapter on the car is called the charging port.
  • Service - Center: Related to the location where service is performed.

Topic Modeling of Topics

After inspecting the most common word pairs used in the discussion forums, specifically in the longer text replies, next we’ll take a look at trying to identify topics of discussion via Topic Modeling on the “subjects” of each of the topics. This method is an unsupervised method that attemps to automatically identify related topics based on the corpus of text.

corpus <- Corpus(VectorSource(df_topics$Topic))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)  # remove punctuation
corpus <- tm_map(corpus, stripWhitespace)    # remove white space
corpus <- tm_map(corpus, removeWords, c(stopwords('english')))
corpus <- tm_map(corpus, lemmatize_strings) # lemmatizaton
# Manually remove odd characters that frequently appear
corpus <- tm_map(corpus,content_transformer(function(x) gsub("“", " ", x)))
corpus <- tm_map(corpus, content_transformer(function(x) gsub("”", " ", x)))
corpus <- tm_map(corpus, content_transformer(function(x) gsub("’", " ", x)))
corpus <- tm_map(corpus, removeWords, c("tesla", "model", "anyone", 
                                        "car", "get", "work", "use",
                                        "come", "question", "can",
                                        "issue", "now"))
dtm <- DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, .995)
inspect(dtm)
## <<DocumentTermMatrix (documents: 3676, terms: 135)>>
## Non-/sparse entries: 5010/491250
## Sparsity           : 99%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   app autopilot battery charge drive mode new range tire update
##   1365   1         0       0      0     0    0   0     0    0      0
##   1639   0         0       0      0     0    0   0     0    0      0
##   2830   0         0       0      0     0    0   0     0    0      0
##   2866   0         0       0      0     0    0   0     0    0      0
##   2998   0         0       0      0     0    0   0     0    0      0
##   3042   0         0       0      0     0    0   1     0    0      0
##   3051   1         0       0      1     0    0   0     0    0      0
##   3290   1         0       0      0     0    0   0     0    0      0
##   3321   0         0       0      0     0    0   0     0    0      0
##   3327   0         0       0      2     0    0   0     0    0      0
sel_idx <- rowSums(as.matrix(dtm)) > 0
dtm <- dtm[sel_idx, ]

Topic Modeling

LDA

From Wikipedia:

In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model and belongs to the machine learning toolbox and in wider sense to the artificial intelligence toolbox.

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Determining the Appropriate Number of Topics

sequ <- seq(2, 20, 1) 

result <- FindTopicsNumber(
  dtm,
  topics = sequ,
  metrics = c("CaoJuan2009","Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 123),
  mc.cores = 6
  )

FindTopicsNumber_plot(result)

Note: “Griffiths2004” metric was causing R Studio to crash and has been omitted.

Top-Level Topics

lda <- LDA(dtm, k = 6, control = list(seed = 1234))
topics <- tidy(lda, matrix = "beta")
top_terms <- topics %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ topic, scales = "free", ncol=3) +
    theme_minimal(base_size = 28) + 
    scale_y_reordered()

# Topic Analysis

Targeted Dictionary Analysis

After analyzing the most common customer topics, we’re going to look at sentiment around each a number of them. Out of the original identified topics, we will look at the following:

  • Full Self Driving: Also knowne as FSD. A much anticipated features that was rolled out in private beta in 2020. There has been quite a bit of coverage of this from early adopters on YouTube.
  • Software - Update: Tesla’s go through regular software updates, every 1-2 weeks.
  • Take - Delivery: Related to the purchase process.
  • Mile - Range & Battery - Degradation: Being a batter powered car, range is a highly talked about issue.
my_corpus <- corpus(df_select$Discussion)  # build a new corpus from the texts

quant_dfm <- dfm(my_corpus, )
quant_dfm <- dfm_trim(quant_dfm, min_termfreq = 4, max_docfreq = 10)

# Reduce the columns to just what's needed
quant_tesla <- select(df_select, doc_id, Discussion, User, Time)

# Quanteda requires the text field to be called "text"
quant_tesla <- quant_tesla %>%
  rename(text = Discussion)

# Create the Corpus
corp_tesla <- corpus(quant_tesla)

# Add columns for Year, Month, and Week Number
corp_tesla$year <- year(corp_tesla$Time)
corp_tesla$month <- month(corp_tesla$Time)
corp_tesla$week <- week(corp_tesla$Time)

# Subset the Corpus for Just 2020
corp_tesla <- corpus_subset(corp_tesla, "year" >= 2020)
toks_tesla <- quanteda::tokens(corp_tesla, remove_punct = TRUE)

Full Self Driving

# get relevant keywords and phrases
fsd <- c("fsd", "self driving", "autopilot")

# only keep tokens specified above and their context of ±10 tokens
toks_fsd <- tokens_keep(toks_tesla, pattern = phrase(fsd), window = 10)

toks_fsd <- tokens_lookup(toks_fsd, dictionary = data_dictionary_LSD2015[1:2])

# create a document document-feature matrix and group it by weeks in 2016
dfmat_fsd_lsd <- dfm(toks_fsd) %>% 
    dfm_group(group = "week", fill = TRUE) 

matplot(dfmat_fsd_lsd, type = "l", xaxt = "n", lty = 1, ylab = "Frequency", 
        main = "Sentiment of Self-Driving/Full Self Driving for 2020")
grid()
axis(1, seq_len(ndoc(dfmat_fsd_lsd)), ymd("2020-01-01") + weeks(seq_len(ndoc(dfmat_fsd_lsd)) - 1))
legend("topleft", col = 1:2, legend = c("Negative", "Positive"), lty = 1, bg = "white")

n_fsd <- ntoken(dfm(toks_fsd, group = toks_fsd$week))
plot((dfmat_fsd_lsd[,2] - dfmat_fsd_lsd[,1]) / n_fsd, 
     type = "l", ylab = "Sentiment", xlab = "", xaxt = "n",
     main = "Sentiment of Self-Driving/Full Self Driving for 2020")
axis(1, seq_len(ndoc(dfmat_fsd_lsd)), ymd("2020-01-01") + weeks(seq_len(ndoc(dfmat_fsd_lsd)) - 1))
grid()
abline(h = 0, lty = 2)

Battery and Range

# get relevant keywords and phrases
bat <- c("battery", "charge", "range", "degradation")

# only keep tokens specified above and their context of ±10 tokens
toks_bat <- tokens_keep(toks_tesla, pattern = phrase(bat), window = 10)

toks_bat <- tokens_lookup(toks_bat, dictionary = data_dictionary_LSD2015[1:2])

# create a document document-feature matrix and group it by weeks in 2016
dfmat_bat_lsd <- dfm(toks_bat) %>% 
    dfm_group(group = "week", fill = TRUE) 

matplot(dfmat_bat_lsd, type = "l", xaxt = "n", lty = 1, ylab = "Frequency",
        main = "Sentiment of Battery/Charging/Range for 2020")
grid()
axis(1, seq_len(ndoc(dfmat_bat_lsd)), ymd("2020-01-01") + weeks(seq_len(ndoc(dfmat_bat_lsd)) - 1))
legend("topleft", col = 1:2, legend = c("Negative", "Positive"), lty = 1, bg = "white")

n_bat <- ntoken(dfm(toks_bat, group = toks_bat$week))
plot((dfmat_bat_lsd[,2] - dfmat_bat_lsd[,1]) / n_bat, 
     type = "l", ylab = "Sentiment", xlab = "", xaxt = "n",
     main = "Sentiment of Battery/Charging/Range for 2020")
axis(1, seq_len(ndoc(dfmat_bat_lsd)), ymd("2020-01-01") + weeks(seq_len(ndoc(dfmat_bat_lsd)) - 1))
grid()
abline(h = 0, lty = 2)

Software Updates

# get relevant keywords and phrases
sw <- c("software", "update")

# only keep tokens specified above and their context of ±10 tokens
toks_sw <- tokens_keep(toks_tesla, pattern = phrase(sw), window = 10)

toks_sw <- tokens_lookup(toks_sw, dictionary = data_dictionary_LSD2015[1:2])

# create a document document-feature matrix and group it by weeks in 2016
dfmat_sw_lsd <- dfm(toks_sw) %>% 
    dfm_group(group = "week", fill = TRUE) 

matplot(dfmat_sw_lsd, type = "l", xaxt = "n", lty = 1, ylab = "Frequency",
        main = "Sentiment of Software Updates for 2020")
grid()
axis(1, seq_len(ndoc(dfmat_sw_lsd)), ymd("2020-01-01") + weeks(seq_len(ndoc(dfmat_sw_lsd)) - 1))
legend("topleft", col = 1:2, legend = c("Negative", "Positive"), lty = 1, bg = "white")

n_sw <- ntoken(dfm(toks_sw, group = toks_sw$week))
plot((dfmat_sw_lsd[,2] - dfmat_sw_lsd[,1]) / n_sw, 
     type = "l", ylab = "Sentiment", xlab = "", xaxt = "n",
     main = "Sentiment of Software Updates for 2020")
axis(1, seq_len(ndoc(dfmat_sw_lsd)), ymd("2020-01-01") + weeks(seq_len(ndoc(dfmat_sw_lsd)) - 1))
grid()
abline(h = 0, lty = 2)

Purchase Process

# get relevant keywords and phrases
own <- c("delivery", "purchase", "owner")

# only keep tokens specified above and their context of ±10 tokens
toks_own <- tokens_keep(toks_tesla, pattern = phrase(own), window = 10)

toks_own <- tokens_lookup(toks_own, dictionary = data_dictionary_LSD2015[1:2])

# create a document document-feature matrix and group it by weeks in 2016
dfmat_own_lsd <- dfm(toks_own) %>% 
    dfm_group(group = "week", fill = TRUE) 

matplot(dfmat_own_lsd, type = "l", xaxt = "n", lty = 1, ylab = "Frequency",
        main = "Sentiment of Purchasing Process for 2020")
grid()
axis(1, seq_len(ndoc(dfmat_own_lsd)), ymd("2020-01-01") + weeks(seq_len(ndoc(dfmat_own_lsd)) - 1))
legend("topleft", col = 1:2, legend = c("Negative", "Positive"), lty = 1, bg = "white")

n_own <- ntoken(dfm(toks_own, group = toks_own$week))
plot((dfmat_own_lsd[,2] - dfmat_own_lsd[,1]) / n_own, 
     type = "l", ylab = "Sentiment", xlab = "", xaxt = "n",
     main = "Sentiment of Purchasing Process for 2020")
axis(1, seq_len(ndoc(dfmat_own_lsd)), ymd("2020-01-01") + weeks(seq_len(ndoc(dfmat_own_lsd)) - 1))
grid()
abline(h = 0, lty = 2)

Conclusion

References